30 - Logic-Based Natural Language Semantics (WS 23/24) [ID:51528]
50 von 616 angezeigt

So today, we want to switch gears.

So far, we've mainly tried to understand topics, special phenomena of natural language.

How do verb phrases work?

How does discourse work?

Modalities.

All of those kind of things.

And we've kind of, we've looked at some other topics in the last, on last Thursday, basically

some of the phenomena that a particular natural language, namely mathematical natural language, exhibits.

What we want to do today, basically as a counterpoint, is another important topic in natural language,

is namely, how do I deal with big corpora?

Okay, how do we, how can we find what actually exists?

I mean, the stuff we've been doing is kind of deep and somewhat narrow.

What would we do if we wanted to scale up to a bigger corpus?

And I'm going to, we are going to, stuff that we're doing for research, Frederick and I,

basically want to see how we can deal with bigger corpora.

And the corpus, there are two corpora we're actually working with and interested in.

You may have seen archive.org.

So, hope you have not seen archive.org.

Archive.org is a project that's been going on since the 1990s, I think 92 or something like this,

where a guy from NASA or something like this, I don't quite remember anymore,

has started collecting preprints, scientific preprints, mostly on,

I think quantum physics and relativity and so on.

And preprints basically were scientists, before they published something, put it out to the community.

And that's been kind of made more sustainable and they now have 2.4 million articles.

And around this, there has been, the community has organized around this and in some parts of

physics and some parts of computer science, you always put on, whenever you've written something,

you put it onto archive even before you publish it. And if you put it up on archive, you can kind of

tick a box and say, oh yes, and please submit this to the Journal of High Energy Physics or something.

And then it's automatically considered, but everybody can look at it, even though it's still

a blog review and still, yeah, and will only be published, say, in nine months from now.

So this is a very nice little thing. And for some areas like astrophysics, it's actually complete

because essentially every article is being published here first. And if

we look at that, you can see here's an article, apparently about, I don't know, something in the sky.

Just how those articles look. So we basically have a good record of scientific articles, 2.4 million.

The problem with this,

no, that's not why I want it, right, is that these are PDFs. PDFs are spectacularly unhelpful

because they basically describe where dots go on pages, where ink goes on pages.

But you can also kind of look at,

no, you can look out, you basically can look at the latex sources. And one of the things that

my group has been doing over the last, well, it's now also 20 years, is actually converted into

HTML5. That's the problem. It's actually too young. We haven't gotten around to converting it.

And I don't want to, I know there's something we can look at. This thing here is actually HTML5,

which is much easier to parse. And we can do research on this. We can see formulae. We can

see all kinds of things. And one thing you can't do with a PDF is the following, namely,

look at it on your tablet or on your smartphone.

Right. And very importantly, we have the thing in HTML. You probably know that

latex is very difficult to parse. PDF is even more horrible.

This is something we can work with, in particular, even including the math bits,

which you can see, these are all the math formulae, the long stuff,

which you can actually automatically do things with. Okay. We have this very nice

Zugänglich über

Offener Zugang

Dauer

01:31:08 Min

Aufnahmedatum

2024-02-06

Hochgeladen am

2024-02-06 16:36:04

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen